- 
                Notifications
    You must be signed in to change notification settings 
- Fork 461
[Blocked] Use Scrub for data cleaning #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fixes issue #138: NA handling in text columns - Add skrub>=0.3.0 dependency to handle mixed string/NA data - Integrate TableVectorizer in TabPFNClassifier to properly process text columns with NA values - Add test to verify the solution works as expected
| Okay we encountered problem, skrub 0.3.0 requires scipy 1.9.3 which isn't compatible with TabPFN | 
| Does it fail without  | 
| I've simplified the implementation to only rely on TableVectorizer without needing the extra  function. Also bumped scikit-learn minimum version to 1.2.1 for compatibility with skrub. Note that scikit-learn 1.2.1 was released in January 2023, so it's still more than 2 years old and should be a reasonable dependency. Same for pandas 1.5.3. | 
| Instead we could use Autogluon AutoMLPipelineFeatureGenerator? | 
Fix #138: NA handling in text columns
Fix #163
Partially fixed by #242
Summary
Test plan